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Abstract 



In this note we prove the dual representation formula of the 
divergence between two distributions in a parametric model. Re- 
sulting estimators for the divergence as for the parameter are 
derived. These estimators do not make use of any grouping nor 
smoothing. It is proved that all differentiable divergences induce 
■ the same estimator of the parameter on any regular exponential 

\ family, which is nothing else but the MLE. 
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1 Introduction 

1.1 Context and scope of this note 



This note presents a short proof of the duality formula for (f— diver- 
gences defined through differentiable convex functions tp in parametric 
models and discusses some unexpected phenomenon in the context of 
exponential families. First versions of this formula appear in [8] p 33, 
in [1] in the context of the Kullback-Leibler divergence and in [7] in a 
general form. The paper [3] introduces this form in the context of min- 
imal x 2 — estimation; a global approach to this formulation is presented 
in Broniatowski and Keziou (2006) [2]. Independently Liese and Vajda 
(2006) [9] have obtained a similar expression based on a much simpler 
argument as presented in all the above mentioned papers (formula (118) 
in their paper); however the proof of their result is merely sketched and 
we have found it useful to present a complete treatment of this interest- 
ing result in the parametric setting, in contrast with the aforementioned 
approaches. 
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The main interest of the resulting expression is that it leads to a wide 
variety of estimators, by a plug in method of the empirical measure eval- 
uated on the current data set; so, for any type of sampling its estimators 
and inference procedures, for any (^—divergence criterion. In the case of 
the simple i.i.d. sampling resulting properties of those estimators and 
subsequent inferential procedures are studied in jl]. 

A striking fact is that all minimum divergence estimators defined 
through this dual formula coincide with the MLE in exponential families. 
They henceforth enjoy strong optimality under the standard exponential 
models, leading to estimators different from the MLE under different 
models than the exponential one. Also this result proves that MLE 's 
of parameters of exponential families are strongly motivated by being 
generated by the whole continuum of (^—divergences. 

This note results from joint cooperation with late Igor Vajda. 

1.2 Notation 

Let V := {Pg, 9 G 0} an identifiable parametric model on M. d where 
G is a subset of R s . All measures in V will be assumed to be mea- 
sure equivalent sharing therefore the same support. The parameter 
space 9 need not be open in the present setting. It may even hap- 
pen that the model includes measures which would not be probability 
distributions; cases of interest cover models including mixtures of prob- 
ability distributions; see [I]. Let <p be a proper closed convex function 
from ] — oo,+oo[ to [0, +oo] with <^(1) = and such that its domain 
doimp := {x G 1R such that f(x) < oo} is an interval with endpoints 
a v < 1 < bp (which may be finite or infinite). For two measures P a 
and Pg in V the (^-divergence between Q and P is defined by 



In a broader context, the ^-divergences were introduced by [5j as "/- 
divergences". The basic property of if — divergences states that when ip 
is strictly convex on a neighborhood of x = 1, then 



We refer to [8j chapter 1 for a complete study of those properties. Let us 
simply quote that in general 4>(a, 9) and 0(, 9, a)are not equal. Hence, 
(^-divergences usually are not distances, but they merely measure some 
difference between two measures. A main feature of divergences between 
distributions of random variables X and Y is the invariance property 
with respect to common smooth change of variables. 




<p(a, 9) = if and only if a = 9. 



2 



1.3 Examples of (^-divergences 

The Kullback-Leibler (KL), modified Kullback-Leibler (KL m ), x 2 , mod- 
ified x 2 (Xto)j Hellinger (if), and L\ divergences are respectively associ- 
ated to the convex functions (p (x) = xlogx — x + 1, <p(x) = — \ogx+x — 1, 
cp(x) = \{x — l) 2 , cp(x) = \{x — l) 2 /x, cp(x) = 2(^fx — I) 2 and cp(x) = 
\x — 1|. All these divergences except the L\ one, belong to the class of 
the so called "power divergences" introduced in [6] (see also [8] chap- 
ter 2), a class which takes its origin from Renyi [10J. They are defined 
through the class of convex functions 

x G]0, +00H ¥> 7 (*) := g7 -^ + 7-l (1) 

717 - 1) 

if 7 G K. \ {0, 1}, ipo(x) := — logs + x — 1 and <pi(x) := xlogx — x + 1. 
So, the i-CL-divergence is associated to ipi, the KL m to <po, the x 2 to y2 2 , 
the to and the Hellinger distance to <fi/2- 

It may be convenient to extend the definition of the power divergences 
in such a way that <p(a, 9) may be defined (possibly infinite) even when 
P a or Pq is not a probability measure. This is achieved setting 

xEl-oc+oohj^ i J^ 1 [0 ' +O °n [ f ( 2 ) 
J I +°° it x Gj — oo, 0[. v ' 

when dom<y9 = IR + / {0} . Note that for the x 2 -divergence, the correspond- 
ing ip function 02 (x) := |(x — l) 2 is defined and convex on whole M. 

We will only consider divergences defined through different iable func- 
tions tp, which we assume to satisfy 

There exists a positive S such that for all c in [1 — 5, 1 + S], 
(RC) we can find numbers Ci, c 2 , c 3 such that 

ip(cx) < Civ?(x) + C2 |x| + C3, for all real x. 
Condition (RC) holds for all power divergences including KL and 
KL m divergences. 

2 Dual form of the divergence and dual estimators 
in parametric models 

Let 9 and 9t be any parameters in O. We intend to provide a new 
expression for <p(9, 9t)- 

By strict convexity, for all a and b the domain of if it holds 

<p(b) > (p(b) + <f/(a)(b - a) (3) 

with equality if and only if a = b. 
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Denote 

y? # (x) := xip'(x) — ip(x). 
For any a in G denote 

dP ft , 



Define 



dP a 



dP, 



(x). 



Inserting these values in (j3J) and integrating with respect to Pg T yields 



9 T ) > 



V -J7T dp e ~ <P 



dP n 



dP n 



dPa 



Assume at present that this entails 

dP f 



T ) > j <f' 



dP n 



dP, 



(4) 



for suitable a's in some set Tq included in 0. 

When a = 9t the inequality in @ turns to equality, which yields 



' I dP, 



Denote 



from which 



h(9, a, x) 



dP a 

dK ) dPe ~" 7 IdK 



4>(9,9t) = sup / h(9, a, x)dPg T . 



(5) 

(6) 
(7) 



Furthermore by (]1J), for all suitable a 



<p(9,9 T ) - J h(9,a,x)dPg T 

= J h(9,9 T ,x)dP dT . - J h{9, a, x)dPg T > 

and the function x — > h(9, 9t, x) — h(9, a, x) is non negative, due to 
03]). It follows that (p(9, 9t) — J h(9, a, x)dPg T is zero only if h(9, a, x) = 
h(9,9r,x)— Pg T a.e. Therefore for any x in the support of Pg T 



dP 



x , 



0j n 



which cannot hold for all x when the functions <p& yjp~i x ) J > V 9 * \1F^~ ( x ) 
and 1 are linearly independent, unless a = 6?- We have proved that 9? 
is the unique optimizer in (jS]). 

We have skipped some sufficient conditions which ensure that (jlj) 
holds. 

Assume that 

dP e < oo. (8) 



Assume further that <f>(9, 9t) is finite. Since 




which entails (j3J). When J \ jp-j dPe T = +oo then clearly , under 

e 

We have proved that ((SJ) holds when a satisfies (jSJ). 

Sufficient and simple conditions encompassing (|Sj) can be assessed 
under standard requirements for nearly all divergences. We state the 
following Lemma (see Liese and Vajda (1987) [8J) and Broniatowski and 
Keziou (2006) [2J, Lemma 3.2). 

Lemma 1 Assume that RC holds and <j)(8, a) is finite. Then |2P holds. 

Summing up, we state 

Theorem 2 Let 9 belong to and let (f>(6, 9?) be finite. Assume that 
RC holds. Let Tq be the subset of all a's in 9 such that <p(9, a) is finite 
. Then 

Furthermore the sup is reached at 9t and uniqueness holds. 
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For the Cressie-Read family of divergences with 7 ^ 0,1 this repre- 
sentation writes 

The set Tq may depend on the choice of the parameter 9. Such is 
the case for the x 2 divergence i.e. <f(x) — (x — l) 2 /2, when p$(x) = 
9 exp(— 0x)l\o jOO \(x). In most cases the difficulty of dealing with a specific 
set Tq depending on 9 can be encompassed when 

There exists a neighborhood U of Qt for which (A) 
(p(9, 9') is finite whatever 9 and 9' in U 

which for example holds in the above case for any 9t- This simplication 
deserves to be stated in the next result 

Theorem 3 When 4>(9, 9t) is finite and RC holds, then under condition 
(A) 

Furthermore the sup is reached at 9t and uniqueness holds. 

Remark 4 Identifying Tq might be cumbersome. This difficulty also 
appears in the classical MLE case, a special case of the above statement 
with divergence function ipo Jor which it is assumed that 

J \ogpo{x)pe T {x)d\{x) is finite 

for 9 in a neighborhood of 9 T . 

Under the above notation and hypotheses define 

T e (Pg T ) := arg sup h(9,a,x)dPg T . (9) 

aeTo J 

It then holds 

T e (Pg T ) = 9 T 

for all 9 T in 6. Also let 

S (P 0T ) ■= arg inf sup / h(9, a, x)dPg T . (10) 
fee aeJ r e J 

which also satisfies 

S{P dT )=9 T 
for all 9t in 0. We thus state 

Theorem 5 When <p(9, 6t) is finite for all 9 in 6 and RC holds, both 
functionals Te and S are Fisher consistent for all 9t in 0. 
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3 Plug in estimators 



From (J7|) simple estimators for 9t can be denned, plugging any conver- 
gent empirical measure in place of Pg T and taking the infimum in 9 in 
the resulting estimator of <p(9, 9t)- 

In the context of simple i.i.d. sampling, introducing the empirical 
measure 



n 

i=l 



where the X{S are i.i.d. r.v's with common unknown distribution Pg T 
in V, the natural estimator of 0(6*, 9t) is 

(f) n (9,9 T ) : = sup ( / h(9,a,x) dP n (x)\ (11) 



Since 



inf (f)(9, 9 T ) = (j)(9 T , 9 T ) = 
eee 



the resulting estimator of 0(#t, Qt) is 

M0t,0t) ■= inf 0„(0,0 T ) = inf sup I / h(9,a,x) dPJx) 1 . (12) 
Also the estimator of #t is obtained as 



# := arg inf sup < / h(9,a,x) dPJx) > . (13) 
see aeJ r e [J J 

When IA1 holds then Tq may be substituted by U in the above definitions. 

The resulting minimum dual divergence estimators (fl2l and (TX3|) 
do not require any smoothing or grouping, in contrast with the classical 
approach which involves quantization. The paper [1] provides a complete 
study of those estimates and subsequent inference tools for the usual i.i.d. 
sample scheme. For all divergences considered here, these estimators are 
asymptotically efficient in the sense that they achieve the Cramer-Rao 
bound asymptotically. The case when <p — <p leads to 9ml defined as 
the celebrated Maximum Likelihood Estimator (MLE), in the context of 
the simple sampling. 
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4 Minimum divergence estimators in exponential 
families 

In this section we prove the following result 

Theorem 6 For all divergence <fr defined through a differentiable func- 
tion ip satisfying Condition (RC), the minimum dual divergence esti- 
mator defined by (TT3|) coincides with the MLE on any full exponential 
families such that (9, a) is finite for all 9 and a in B. 

Let V be an exponential family on W with canonical parameter in 

R d 

f P e such that p (x) = ^{x) \ 
\=exp[T(xyO-C(0)];0eQ) 

where x is in IR S and G is an open subset of M. d , and A is a dominating 
measure for V. We assume V to be full, namely that the Hessian matrix 
(d 2 /dO 2 ) C{9) is definite positive for all 9 in ©. 

Let X\, X n be n i.i.d. random variables with common distribution 
Pq t with 9 T in 6. Introduce 



M n (9,a) := 




We will prove that 

inf sup M n (9, a) = (14) 

a 

whatever the function ip satisfying the claim. In (fT4l) 9 and a run in G. 
This result extends the maximum likelihood case for which infg sup Q M n (9, a) = 
sup e inf a [J J2i=i ^°SPe (Xi) - \ J27=i lo SPa (Xt)] = 0. 
Direct substitution shows that for any 9, 

sup M n (9, a) > M n (9, 9) = 

a 

from which 

inf sup M n (9, a) > (15) 

8 a 

We prove that 

ot = 9ml is the unique maximizer of M n (9ml, oc) (16) 
which yields 

inf sup M n (9, a) < sup M n {0 ML , a) = M n (9 ML , 9 ML ) = (17) 



S 



which together with ( 1151) completes the proof. 
Define 

M n> i(9,a) :— J if' (exp A(9, a, x)) exp B (9, x) dX(x) 
1 n 

M n , 2 (9, a) := - V exp (A (9, a, X,)) <p' (exp A(9, a, X t )) 

i=l 

1 " 

M n , 3 (9, a) := - V f (exp A{9, a, 

n ^— » 



with 

A(0, a, x) := T(x)' {9 - a) + C{a) - C{9) 
B(9,x) :=T{x)'9-C{9). 

It holds 

M n (9, a) = M nA (9, a) - M n>2 (9, a) + M n , 3 {9, a) 

with 

^-M nA (9, a) a=e = V° (1) [VC (9) - VC (a) a=e \ = 

for all 9, 

r) 1 n 

^M n , 2 (9 ML , a) a=euL = ^ (1) - ^ + (a)^J 

i=l 

and 



i=l 

where the two last displays hold iff a = 9ml- Now 

-^M n>1 (9 ML , a) a=dML = (^(1) + 2^(1)) {d 2 /d9 2 ) C(9 ML 
^M n>2 (9 ML ,a) a=9ML = (^(1)+ V 2 )(l)) {d 2 /d9 2 ) C(9 ML 
—M n , 3 (9 ML ,a) a=9ML = ^ 2 \l) (d 2 /d9 2 ) C(9 ML ), 
whence 



d_ 

da 
d 2 



M n (6 ML , a) 



M n (9 ML ,a) 



<x=6ml 







W(i){d 2 /ae 2 ) c(e ML ) 



da 2 



d=0 M L r 



which proves (!T6|) . and closes the proof. 
References 

[1] Broniatowski, M. Estimation of the Kullback-Leibler divergence. 

Math. Methods Statist. 12 (2003), no. 4, 391-409 . 
[2] Broniatowski, M.; Keziou, A. Minimization of ^-divergences on sets 

of signed measures. Studia Sci. Math. Hungar. 43 (2006), no. 4, 403- 

442. 

[3] Broniatowski, M.; Leorato, S. An estimation method for the Ney- 

man chi-square divergence with application to test of hypotheses. 

J. Multivariate Anal. 97 (2006), no. 6, 1409-1436. 
[4] Broniatowski, M. Keziou, A. Parametric estimation and tests 

through divergences and the duality technique. J. Multivariate 

Anal. 100 (2009), no. 1, 16-36. 
[5] Csiszar, I. Eine informationstheoretische Ungleichung und ihre An- 

wendung auf den Beweis der Ergodizitat von Markoffschen Ketten. 

(German) Magyar Tud. Akad. Mat. Kutato Int. Kozl. 8 1963 85- 

108. 

[6] Read, T. R. C, Cressie, N. A. C. Goodness-of-fit statistics for 
discrete multivariate data. Springer Series in Statistics. Springer- 
Verlag, New York, 1988. xii+211 pp. ISBN: 0-387-96682-X 

[7] Keziou, A. Dual representation of ^-divergences and applications. 
C. R. Math. Acad. Sci. Paris 336 (2003), no. 10, 857-862 

[8] Liese, F., Vajda, I. Convex statistical distances. Teubner-Texte zur 
Mathematik [Teubner Texts in Mathematics], 95. BSB B. G. Teub- 
ner Verlagsgesellschaft, Leipzig, 1987, ISBN: 3-322-00428-7 . 

[9] Liese, F., Vajda, I. On divergences and informations in statistics 
and information theory. IEEE Trans. Inform. Theory 52 (2006), no. 
10, 4394-4412 

[10] Renyi, A. On measures of entropy and information. 1961 Proc. 4th 
Berkeley Sympos. Math. Statist, and Prob., Vol. I pp. 547-561 Univ. 
California Press, Berkeley, Calif. 



10 



